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Abstract. Replication helps ensure that a genotype-phenotype associ- 
ation observed in a genome- wide association (GWA) study represents 
a credible association and is not a chance finding or an artifact due 
to uncontrolled biases. We discuss prerequisites for exact replication, 
issues of heterogeneity, advantages and disadvantages of different meth- 
ods of data synthesis across multiple studies, frequentist vs. Bayesian 
inferences for replication, and challenges that arise from multi-team 
collaborations. While consistent replication can greatly improve the 
credibility of a genotype-phenotype association, it may not eliminate 
spurious associations due to biases shared by many studies. Conversely, 
lack of replication in well-powered follow-up studies usually invalidates 
the initially proposed association, although occasionally it may point to 
differences in linkage disequilibrium or effect modifiers across studies. 
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meta-analysis. 
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1. INTRODUCTION 

Reproducibility has long been considered a key 
part of the scientific method. In epidemiology, where 
variable conditions are the rule, the repeated obser- 
vation of associations between covariates by differ- 
ent investigative teams, in different populations, us- 
ing different designs and methods is typically taken 
as evidence that the association is not an artifact 
[22], for two principal reasons. First, repeated ob- 
servation adds quantitative evidence that the asso- 
ciation is not due to chance alone; second, replica- 
tion across different designs and populations pro- 
vides qualitative evidence that the association is not 
due to uncontrolled bias affecting a single study. 
Moreover, accumulated evidence can provide more 
accurate estimates of the effect measures of the risk 
factor being studied and their uncertainty. 

Genetic epidemiology learned the importance of 
replication the hard way. Before the advent of genome- 
wide association (GWA) studies, most reported geno- 
type-phenotype associations failed to replicate. There 
were a number of reasons for these conflicting re- 
sults, including the following: inappropriate reliance 
on standard significance thresholds that did not take 
the low prior probability of association into account, 
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small sample sizes, and failure to measure the same 
variant(s) across different studies [23, 27, 64]. In re- 
sponse, the field moved toward more stringent re- 
quirements for reporting associations, explicitly em- 
phasizing replication [7]. Many high-profile journals 
now will not publish genotype-phenotype associa- 
tions without concrete evidence of replication [1] . 

In this article we review the requirements for repli- 
cating associations discovered via GWA studies in 
light of recent developments: in particular, the in- 
creasing role of the consortia of multiple GWA stud- 
ies. Prospective meta-analysis of multiple genome- 
wide studies (conducted by different investigative 
teams, in different populations, using different tech- 
nologies and different designs) can satisfy the re- 
quirement for replication in the context of gene dis- 
covery, without the need to genotype yet more sam- 
ples in yet further studies, as long as the combined 
evidence for association is strong and consistent [71]. 
This is an important point, since very large sample 
sizes are required to reliably identify common vari- 
ants with modest effects, and formal replication of 
an association — that is, geno typing the initially dis- 
covered genetic variant in a new, completely inde- 
pendent sample of sufficient size — may be too expen- 
sive in terms of time, money and available samples. 
Indeed, for some rare diseases (e.g., Creutzfeldt- 
Jakob disease) or relatively uncommon diseases (e.g., 
pancreatic cancer), most if not all samples with 
readily-available DNA may be genotyped as part of 
initial GWA studies used at the discovery stage. 

We describe the goals of replication and statistical 
rules of thumb for distinguishing chance from true 
associations in the first section of this article. We 
then discuss the importance of exact replication — 
seeing a consistent association with the same risk al- 
lele using the same analytic methods across multiple 
studies — and describe analytic methods for combin- 
ing evidence across multiple studies, along with their 
relative advantages and disadvantages. We close by 
discussing why an association may fail to replicate 
and place replication efforts in the wider picture of 
contemporary genetic epidemiology, with its focus 
on large-scale collaborations and data sharing. 

2. GOALS OF REPLICATION 

There are two primary reasons replication is es- 
sential to confirm associations discovered via GWA 
studies: to provide convincing statistical evidence 
for association, and to rule out associations due to 
biases. Another possible aim of replication is to im- 
prove effect estimation. 



2.1 Convincing Statistical Evidence for 
Association 

To date most individual GWA studies do not have 
enough power to detect true associations at the con- 
servative significance levels necessary to distinguish 
false positives from false negatives. This point has 
typically been made by referencing the large num- 
ber of tests conducted in a GWA study and the con- 
sequent severe adjustment of the p- value threshold 
in order to control experiment-wide Type I error 
rate. Empirical estimates of the threshold needed 
to preserve the genome-wide Type I error rate in 
studies of European-ancestry subjects using current 
genotyping arrays range from 5 x 1CP 7 to 1 x 1CP 8 
[18, 24, 51, 59]. These thresholds are different in 
other populations; for example, they are even lower 
in African or African- American samples, due to the 
greater genetic diversity in these populations. Even 
these stringent thresholds take into account only the 
complexity of the genetic architecture, and they do 
not adjust for the potential complexity of the pheno- 
typic architecture, that is, when targeting multiple 
phenotypes. 

In the framework of the Bayes theorem, the prob- 
ability that an observed association truly exists in 
the sampled population depends not only on the ob- 
served p- value for association, but also on the power 
to detect the association (a function of minor al- 
lele frequency, effect size and sample size) , the prior 
probability that the tested variant is associated with 
the trait under study, and the anticipated effect size 
[27, 64, 65]. We illustrate this in Figure 1, where we 
plot the Bayes Factor for association (versus no as- 
sociation) as a function of p- value, sample size and 
minor allele frequency [29]. The Bayes Factor is the 
ratio of the probability of the data under the alterna- 
tive hypothesis (association with the tested variant) 
to the probability of the data under the null hypoth- 
esis (no association). (Others define the Bayes Fac- 
tor as the inverse of this ratio [29].) The posterior 
odds of true association given the data are equal 
to the Bayes Factor times the prior odds of asso- 
ciation. In Figure 1 the dashed line represents the 
Bayes Factor needed to achieve posterior odds for an 
association of 3 : 1, assuming prior odds of associa- 
tion of 1 : 99,999 (i.e., roughly 100 out of 10,000,000 
variants are truly associated with the studied trait). 

Note that all p- values are not created equal: for a 
given p-value, the evidence for association increases 
with increasing sample size and depends on risk al- 
lele frequency. Increasing overall sample size not only 
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Fig. 1. The relationship between the Bayes Factor and p -value for different sample sizes and minor allele frequencies (left 
panel: minor allele frequency of 40%; right panel: 5%). The dashed line represents the Bayes Factor necessary to achieve pos- 
terior odds in favor of association of 3 : 1 or greater, assuming the prior odds of association are 1 : 99,999. Bayes Factors were 
calculated for a case-control study with equal numbers of cases and controls, assuming the expected value of the absolute value 
of the log odds ratio is log(1.15), and assuming a "spike and smear" prior. Calculations use equations (4) and (5) from [29], 
with a 2 = I7p — I~i [laa] -1 Ipa> w ^ ere I * s ^ e Fisher information from simple logistic regression log(odds) = a + /3G a dditivc 
calculated under the null = 0). 



increases power to detect common risk variants with 
modest effects, but it can also increase the cred- 
ibility of the observed associations. Differences in 
credibility by sample size for similar p- values are a 
consequence of the fact that these calculations take 
assumptions about the expected magnitude of the 
true allelic odds ratio into account. In particular, 
they assume that the true allelic odds ratio is un- 
likely (probability < 2.1%) to be smaller than 0.5 
or bigger than 2. Since small p- values can only be 
achieved in small sample sizes if the estimated ef- 
fect is large, these results are perceived to be less 
credible in this framework. 

In other words, the Bayes Factor and thus the 
credibility of an association depends explicitly on 
what we assume for the typical magnitude of likely 
genetic effects. For example, if we assume that the 
average effect is not an odds ratio of 1.15 as in Fig- 
ure 1, but an odds ratio of OR av = 1.5, then the 
prior odds of association will be less, because fewer 
variants — with larger effects than in the OR av = 
1.15 scenario — would suffice to explain the genetic 
variability. A larger Bayes Factor would be needed to 
reach a 3 : 1 posterior. Moreover, large effects emerg- 
ing from small studies will be more credible than in 
the Oi?av = 1.15 scenario, while very small effects 
emerging with similar p-values from large studies 
will be less credible [29]. 



Conversely, if we assume that the average effect 
is an odds ratio of OR av = 1.02 (consistent with the 
theory of infinitesimal effects, each having an almost 
imperceptible contribution [14]), then the prior odds 
of association will be much higher, because a much 
larger set of (infinitesimally) associated variants are 
anticipated and a smaller Bayes Factor would be 
needed to reach a 3 : 1 posterior. Moreover, large ef- 
fects emerging from small studies would be incredi- 
ble, regardless of their p- value, while very small ef- 
fects emerging with modest p- values from large stud- 
ies would provide credible evidence for association. 

We should acknowledge that the distribution of ef- 
fect sizes of true associations is unknown, and there 
is no guarantee that they would be similar for dif- 
ferent traits. The difficulty of arriving at the true 
causal variants (which may have larger effect sizes 
than their markers) adds another layer of complex- 
ity. Moreover, given simple power considerations, it 
is expected that a large proportion of the large ef- 
fects have been identified, while only a small pro- 
portion of the smaller effects and a negligible pro- 
portion of the tiny and infinitesimal effects are al- 
ready discovered. With these caveats, most evidence 
from GWA studies to date is more compatible with 
the scenarios of OR av being in the range of 1.15 
[45], but the 1.02 scenario is not implausible, and 
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for some traits the 1.5 scenario may be operating, 
but we still have not identified the true variants. 

Many research groups cannot afford to genotype 
the large sample sizes needed to reliably detect ge- 
netic markers that are weakly associated with a trait 
using a genome-wide platform. This has sparked in- 
terest in multistage designs, where a subset of avail- 
able samples are genotyped using the genome-wide 
platform, and then a subset of the "most promis- 
ing" markers (typically those with lowest p-values) 
are genotyped using a custom platform. These de- 
signs are reviewed in more detail elsewhere in this 
issue [60]. We should note that the primary motiva- 
tion of multistage designs is not to increase power 
by testing fewer hypotheses in the second stage sam- 
ples, and hence paying a smaller penalty for multi- 
ple testing at the second stage. Rather the primary 
goal of multistage designs is to save genotyping costs 
or maximize power given a fixed genotyping bud- 
get. If genotyping costs were not an issue, then the 
multistage approach is less powerful than simply 
testing all markers in the entire available sample 
[33, 58, 60]. As genotyping costs decrease, and as 
more samples have been genotyped as part of pre- 
vious GWA analyses, single-stage analyses become 
more common [33]. 

The appropriate threshold for claiming associa- 
tion depends also on the context and the relative 
costs for false positive and false negative results. For 
example, re-sequencing a region and conducting in 
vivo and in vitro functional studies is quite expen- 
sive, and will require convincing evidence that the 
observed association is true. On the other hand, in- 
cluding a region in a predictive genetic risk score is 
relatively inexpensive, so a less stringent threshold 
might suffice. This approach to replication is intu- 
itively Bayesian (although it need not use formal 
Bayesian methods): each successive study serves to 
update the prior for association in subsequent stud- 
ies. 

2.2 Ruling Out Association Due to Artifact 

Even when the initial association is unlikely to be 
a stochastic artifact due to multiple testing, it may 
still be an artifact due to bias. For common vari- 
ants, the anticipated effects are modest — for binary 
traits, odds ratios smaller than 1.5; for continuous 
traits, percent variance explained less than 0.5% — 
and very similar in magnitude to the subtle biases 
that may affect genetic association studies — most 



notably population stratification bias. For this rea- 
son, it is important to see the association in other 
studies conducted using a similar (but not identical) 
study base. In principle, careful design and anal- 
ysis should eliminate or greatly reduce bias due to 
population stratification in association studies using 
unrelated individuals [17, 44, 52] — and, in practice, 
these methods have effectively removed some wor- 
rying systematic inflation in association statistics 
[19]. Family-based designs can provide additional 
evidence that an observed association is not due 
to population stratification bias, but these designs 
are not cost-efficient, and have their own unique 
sources of bias. For example, nondifferential geno- 
typing error can inflate Type I error rates in some 
family-based analyses, although it does not change 
the Type I error rate [47]. 

2.3 Improving Effect Estimates 

Another reason to conduct replication studies is 
to extend the generalizablity of the association. It 
is important to know if the association exists and 
has similar magnitude in different environmental or 
genetic backgrounds. It is particularly interesting to 
know how these associations play out in populations 
of non-European ancestry, considering most GWA 
studies to date have been conducted in European- 
ancestry samples. Differences in allele frequencies 
and local linkage disequilibrium (LD) patterns across 
populations present both challenges and opportuni- 
ties for replication and fine mapping. On the one 
hand, a marker allele that is strongly associated with 
a trait in one population may not have a detectable 
association in another, as the allele frequency may 
be smaller or the LD with the (unknown) causal 
variant may be much weaker. Thus, initial replica- 
tion studies should focus on populations with ge- 
netic ancestry similar to that sampled in the study 
that first observed the marker-trait association, us- 
ing the exact strategy outlined in the next section. 
Once credible evidence for this association has been 
established, replication efforts in other populations 
should type not only the marker known to be asso- 
ciated in the original population, but other markers 
that "tag" common variation in a region surround- 
ing the marker. For fine mapping, differences in LD 
patterns across populations — notably the lower lev- 
els of LD in African-ancestry populations — might 
lead to refined estimates of the position of causal 
variants [55, 62]. 
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Replication may also be useful in identifying a 
more reliable estimate of the effect size for the asso- 
ciation. Signals selected based on statistical signifi- 
cance thresholds in underpowered settings are likely 
to have (on average) inflated effects due to the win- 
ner's curse phenomenon [28, 69, 70, 75]. Replication 
should take this into account during the sample size 
calculations for the replication efforts; the effect es- 
timate from the initial study may be inflated, lead- 
ing to an under-estimate of the number of subjects 
needed to reliably detect it [69, 70, 75, 76]. Analytic 
methods are available to adjust for winner's-curse 
bias, but studying the marker in additional samples 
(beyond those used to initially identify the marker) 
will help produce more unbiased estimates of the ge- 
netic effect. Accurate estimates of marker risks are 
important (even if the marker is only a surrogate for 
the as yet unknown causal variant), as they may be 
used for personalized predictive purposes [34, 68]. 

Finally, when there are several putative associa- 
tion signals in a region of high LD, dense geno typ- 
ing in replication studies may help elucidate whether 
they represent independent loci, each with its own 
effect in the trait, or whether one or all are "passen- 
ger" markers, which have no effect conditional on 
the true underlying causal variant. Detailed discus- 
sion of fine mapping issues is beyond the scope of 
this review, but in light of the effort involved, such 
"fine mapping" efforts should arguably be reserved 
for loci with credible evidence for association, for ex- 
ample, loci with markers that have been replicated 
exactly, as discussed in the next section [8] . 

3. PREREQUISITES FOR EXACT 
REPLICATION OF A PUTATIVE 
ASSOCIATION FROM A GWA STUDY 

One of the early difficulties in replicating genetic 
associations observed in candidate gene studies was 
the fact that different groups would study different 
markers in the same region. Because the LD among 
these markers was poorly understood, results from 
multiple studies could increase rather than decrease 
confusion. The initial study may have seen an asso- 
ciation with SNP A, but the second study did not 
genotype that SNP, and instead saw an association 
with SNP B, which was not genotyped in the original 
study. As the number of SNPs typed per region in- 
creased, "moving the goalposts" in this fashion con- 
tributed to the problem of persistent false positives 
in the candidate gene literature; by chance, some 



SNP in the region (not necessarily the SNP that was 
statistically significant in other studies) would have 
p < 0.05, and this would be (incorrectly) proclaimed 
replication [49]. In response to this problem, guide- 
lines for replication in genetic association studies 
now call for exact replication. The same marker — 
or, if technical difficulties preclude this, a perfect or 
near-perfect proxy for the original marker — should 
be genotyped across all studies and analyzed using 
the same genetic model. In this section we discuss 
prerequisites for exact replication. We use the term 
"exact replication" cautiously, recognizing that this 
is an unattainable goal in epidemiology (e.g., stud- 
ies conducted by different investigators at different 
times, let alone places, will sample from different 
populations) and that in some sense it is the "inex- 
actness" of replication studies that increases credi- 
bility of the observed association (it is less likely to 
be an artifact due to a bias that is unique to the 
initial study). We use the term to emphasize the 
danger of "moving the goalposts" so far that claims 
of replication carry little weight. 

3.1 Test the Same Marker 

This should be done preferably by directly geno- 
typing this marker. Currently-available imputation 
methods are powerful and quite accurate for filling 
in information on missing common SNPs [12, 38, 
39, 46, 50]. Even then, further confirmation by di- 
rect genotyping would be very useful. (In fact, to 
rule out technical artifact, some have argued that 
an associated SNP should be genotyped using two 
different genotyping technologies, or that a second 
SNP in the region that is in [near-] perfect LD with 
the associated SNP be genotyped [7].) Great caution 
is needed when "replicating" an association by find- 
ing an association with a (different) nearby marker: 
if the new marker does not have perfect or almost 
perfect LD with the previously discovered one, this 
cannot be considered replication. Moreover, even for 
markers with seemingly perfect LD in a given sam- 
ple, the LD may be far less than perfect in a different 
population and it may break completely in popula- 
tions of different ancestry. When a panel of markers 
spanning the whole locus is pursued (e.g., after rese- 
quencing and fine mapping), different markers and 
haplotypes may be found to be associated in dif- 
ferent populations. Evidence from different markers 
and haplotypes should not be combined in the same 
meta-analysis. The consistency of each association 
can be formally assessed separately (see the section 
on statistical heterogeneity). 



6 



P. KRAFT, E. ZEGGINI AND J. P. A. IOANNIDIS 



3.2 Use the Same Analytic Methods 

If the initial results found an increased risk per 
copy of, say, the A allele (additive model), then a 
significant increased risk for carriers of the T allele 
(dominant model, in other direction) does not con- 
stitute replication. It is in principle possible that the 
direction of association can change due to differences 
in linkage disequilibrium across study populations. 
However, this "flip flop" phenomenon can occur only 
in very specific situations that are unlikely when the 
study populations have similar continental ancestry 
[41]. The burden of proof is on investigators to show 
evidence for how difference in LD in their study pop- 
ulations could produce a "flip flop" if they wish to 
claim replication, even though different alleles are 
associated with risk. Merely citing the possibility of 
"flip-flopping" does not suffice. 

Other analytical options include the statistical 
model (e.g., for a binary outcome, whether it is 
treated as simply yes / no or the time-to-event is also 
taken into account), the use of any covariates (e.g. 
for age, gender or topic-specific variables) and the 
use of corrections for relatedness. Usually, the im- 
pact of these options is not major, but it can make 
a difference for borderline associations which may 
seem to pass or not pass a desired p- value threshold. 
This means that both for GWA studies and subse- 
quent investigations, one should carefully report the 
methods in sufficient detail so they can be indepen- 
dently replicated by other researchers [42]. 

Modeling can have a much more profound im- 
pact in more complex associations than go beyond 
single markers, for example, with approaches that 
try to model dozens and hundreds of gene variants 
that form a "pathway" [5, 35]. Such complex mod- 
els may be built by MDR, kernel machines, step- 
wise logistic regression or a diversity of other meth- 
ods and it is important for the replication process 
to use the same exact steps as the model build- 
ing. Even then, because these models are so flexi- 
ble, it is unclear whether a "significant" finding in a 
second data set constitutes replication; the associa- 
tion may be driven by different sets of SNPs in the 
different studies. Researchers who conduct complex 
model-selection/model-building analyses should re- 
port their "final" model in as much detail as possi- 
ble, so other investigators can judge the fit of that 
model in other data sets. 



3.3 Try to Use the Same Phenotype 

For many traits, phenotype definitions may vary 
considerably across studies, or there may be many 
different options for defining the phenotypes of in- 
terest within each study. Some of this variability is 
unavoidable and results from differences in measure- 
ment protocols across studies. For example, disease 
may be self-reported in some studies or clinician- 
diagnosed in others; waist:hip ratios may be self- 
reported or measured in a clinic, using different op- 
erational definitions of "waist"; etc. Characteristics 
of studied phenotype may also differ across stud- 
ies: for example, because of the widespread use of 
Prostate-Specific Antigen (PSA) screening in the 
United States since the early 1990s, the proportion 
of early-stage prostate cancer cases in the US is 
higher than in Europe, where PSA screening is not 
as common. In the context of a prospective meta- 
analysis, study investigators can discuss these issues 
and reach consensus on how to define phenotype so 
as to maximize relevant information while ensur- 
ing as many studies can provide data as possible. 
In general, there is a trade off between more accu- 
rate (but more expensive and perhaps more inva- 
sive) measurements on fewer people and less accu- 
rate (but cheaper) measurements on more people. 
For example, although the Fagerstom Test may be 
a "gold standard" measure of nicotine dependence, 
currently only a few studies with available genome- 
wide genotype data have collected data on this test; 
on the other hand, many studies have collected in- 
formation about the number of cigarettes smoked 
per day (a component of the Fagerstrom score) [6]. 
To maximize sample size, investigators may agree to 
analyze cigarettes per day (which then raises further 
issues such as what scale to use, whether and how 
to transform the raw data, how to reconcile contin- 
uous with categorical data, etc.). Prospective meta- 
analyses for height, BMI and fasting glucose have 
dealt with the issue of phenotype harmonization in 
a trait-by-trait basis [37, 43, 53, 67]. Other consortia 
and projects such as the Public Population Project 
in Genomics (http://www.p3gconsortium.org/) 
and PhenX (www.phenx.org) aim to facilitate broad 
collaboration among existing and future genome- 
wide association studies by making recommenda- 
tions for standard phenotyping protocols for many 
diseases and traits. Still, despite best efforts to har- 
monize measures, some measurement differences 
across studies will persist, and investigators should 
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be aware of these as possible sources of heterogene- 
ity (see Sections 4 and 5). 

Requiring that replication studies use the same 
phenotype definition used in the initial study also 
helps avoid false positives due to "data dredging," 
the temptation to generate small p-values by testing 
many different traits (different case subtypes, con- 
tinuous traits dichotomize using different, arbitrary 
cut points, etc.) [27]. When many phenotypes or 
phenotype definitions and analyses are used, there 
should be a penalty for multiple testing. Applying 
this penalty is not always straightforward, given that 
most of the phenotypes and analyses are usually cor- 
related or even highly correlated. However, the dan- 
ger exists for an association to be claimed replicated, 
after searching through repeated modifications of 
the phenotypes and analyses thereof. A p- value that 
has been obtained through such an iterative search- 
ing path is not the same as one that was obtained 
from a single main analysis of a single phenotype. 

4. REPLICATION METHODS AND 
PRESENTATION OF RESULTS 

4.1 Statistical Heterogeneity Across Datasets 

There are several tests and metrics of between- 
dataset heterogeneity, borrowed from applications 
of meta-analysis in other fields. The most popular 
are Cochran's Q test of homogeneity [9], the I 2 met- 
ric [obtained by (Q-degrees of freedom)/Q] and the 
between-study variance estimator r 2 [21]. There are 
shortcomings to all of them [26] . The Q test is un- 
derpowered in the common situation where there are 
few datasets and may be overpowered when there 
are many, large datasets. There are now readily- 
available approaches that can be used to compute 
the power of the Q test to detect a given tau-squared 
[4]. When the Q test is underpowered, the I 2 met- 
ric has large uncertainty and this can be readily vi- 
sualized by computing its 95% confidence intervals 
[30] . Similarly, estimates of r 2 may have large uncer- 
tainty. One potentially useful approach may be to 
estimate the magnitude of between-study variabil- 
ity compared with the observed effect size 6, that 
is, h = t/9. For a small effect size, even small r 2 
may question the generalizability of the conclusion 
that there is an association across all datasets. This 
conclusion would not be as easily challenged in the 
presence of a large effect size. 



Some other caveats should be mentioned. The win- 
ner's curse in the magnitude of the effect in the dis- 
covery phase may introduce spuriously inflated het- 
erogeneity, when the discovery data are combined 
with subsequent replication studies. In such two- 
stage approaches, between-study heterogeneity 
should best be estimated excluding the discovery 
data. Conversely, if all datasets are measured with 
genome- wide platforms and GWA scan meta-analysis 
is performed in all gene variants, this is no longer 
an issue. In fact, if the GWA scan meta-analysis 
uses random effects (see below), the emerging top 
hits from the GWA scan meta-analysis are likely 
to have, on average, deflated observed heterogene- 
ity compared with the true heterogeneity. This is 
because underestimation of the between-study het- 
erogeneity favors a variant to come to the top of the 
list, since it does not get penalized by wider confi- 
dence intervals in the random effects setting. 

However, we caution that when the number of 
studies is relatively small, association tests based 
on random-effects meta-analysis may be deflated, as 
the between-study variance r 2 will be poorly esti- 
mated. This is illustrated in Figure 2, which shows 
quantile-quantile plots for fixed-effect and random- 
effects meta-analyses of data from PanScan collabo- 
ration, which involves 13 studies in the initial GWAS 
scan. For the random effects analysis, the genomic- 
control "inflation factor" is in this case more aptly 
named a "deflation factor": Agc = 0.84, indicating 
that the random effects p- values are larger than ex- 
pected under the assumption that the vast major- 
ity of SNPs are not associated with pancreatic can- 
cer. Fixed-effect meta analysis is arguably more ap- 
propriate as an initial screening test for associated 
markers, although because fixed-effect analysis can 
be highly significant when only one (relatively large) 
study shows evidence for association, analyses that 
incorporate effect heterogeneity such as random ef- 
fects meta-analysis should be reported for highly sig- 
nificant markers from fixed-effect analyses. 

Finally, lack of demonstrable heterogeneity may 
be perceived criterion of credible replication 

[31]. However, one should note that tests and mea- 
sures of heterogeneity address whether effect sizes 
across different datasets vary, not whether they are 
consistently on the same side of the null. Dataset- 
specific effects could vary a lot, but they may all 
still point to the same direction of effect. Given the 
potential diversity of LD structure across popula- 
tions, and differences in phenotype definitions and 
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O Fixed: A = 1.002 
O Random: A = 0.839 



Q 




— I 1 1 1 1 1 1 

□ 1 2 3 4 5 6 

Expected 

Fig. 2. Quantile-quantile plots for fixed-effect and random- 
effect meta-analyses of the 13 studies in the initial PanScan 
genome-wide association study of pancreatic cancer. The ge- 
nomic control inflation factors Agc for the fixed-effect and 
random effect analyses were 0.84 an d 1.00, respectively. Agc 
was calculated as the median observed chi-squared test statis- 
tic divided by the median of a chi-squared distribution with 
one degree of freedom. 

measurements across studies, between-study hetero- 
geneity should not dismiss an association because 
the effect sizes are not consistent, if the evidence for 
rejection of the null hypothesis is strong. 

4.2 Models for Synthesis of Data from Multiple 
Replication Studies 

Data across studies can be combined at the level 
of either p-values (probability pooler methods) or 
effect sizes (effect size meta-analysis) [12, 32, 61]. 
When p-values are combined, at a minimum one 
should take into account also the direction of effects, 
but the magnitude of the effects is not taken into 
account. When effect sizes are used, there are sev- 
eral models that can be used, depending on whether 
between-study heterogeneity is taken into account or 
not, and if the former, how this is done. In general, 
fixed effects approaches that ignore between-study 
heterogeneity are better powered than random ef- 
fects approaches and thus more efficient for discov- 
ery purposes. However, there is a trade-off for in- 
creased chances of false-positives. For effect estima- 
tion and predicting what effects might be expected 
in future similar populations, random effects are in- 
tuitively superior in capturing better the extent of 



the uncertainty. Commonly, random effects are esti- 
mated with a 95% CI that captures the uncertainty 
about the mean effect, but ideally one should also 
examine the uncertainty of the distribution of effects 
across populations. This is provided by the predic- 
tion interval. An approximate (1 — a)% prediction 
interval for the effect in an unspecified study can be 
obtained from the estimate of the mean effect ft , 
its estimated standard error and the estimate of the 
between-study variance f 2 by 

/}±^ 2 ^{f 2 + SE(A) 2 }, 

where t^_ 2 is the 100(1 — a/2)% percentile of the 
i-distribution with k — 2 degrees of freedom [20] . It 
becomes implicit that when an association has been 
probed in only a few datasets, then the prediction 
interval will be wider than the respective confidence 
interval, even if there is no demonstrable between- 
study variance (i.e., r 2 = 0). Table 1 summarizes 
some issues that arise in selecting, interpreting and 
comparing the properties and results of various com- 
monly used meta-analyses methods. 

5. REASONS FOR NON REPLICATION 

[Tfhere are often two or more hypothe- 
ses which account for all the known facts 
on some subject, and although, in such 
cases, men [sic] of science endeavour to 
find facts which will rule out all the hy- 
potheses except one, there is no reason why 
they should always succeed. — Bertram 
Russell [54] 

A variant observed to be associated with a trait 
in an initial GWA may not be associated with the 
trait in subsequent studies, even though the original 
association was (nearly) "genome-wide significant." 
There are a number of potential reasons for this non- 
replication. 

(a) The original observation was a false positive 
due to sampling error. This is the default explana- 
tion, until proven otherwise. This is more likely for 
associations that were not (or just barely) "genome- 
wide significant" than for observations that were ex- 
tremely statistically significant. 

(b) The follow-up study had insufficient power. 
This problem can be avoided by ensuring the follow- 
up study is large enough to reliably detect the ob- 
served effect (after accounting for inflation due to 
"winner's curse") [69, 70, 75, 76]. Moreover, if we 
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Table 1 

Different methods for meta-analysis in the genome-wide association setting 



Issues and caveats 


P-value 


Effect size meta-analysis 




meta-analysis 










Fixed effects 


Random effects 


Direction of effect is consid- 


In some methods 


Yes 


Yes 


ered 








Effect size is considered 


No 


I cs 


Vet. 

i es 


Summary p-value is obtained 


Yes 


Yes 


Yes 


Summary effect is obtained 


No 


I Lh 


1 CS 


Summary result can be 


No 


Icq 


Vpq 
1 CS 


converted to credibility based 








on priors for the anticipated 








effect sizes 








Between-study heterogeneity 


No 


IN U 


V n n 

I CS 


can be taken into account 








Between-study heterogeneity 


No 


I CS 


1 cs 


can be estimated/tested 








Consensus on if/how datasets 


No 


Yes 


Yes 


should be weighted 








Commonly used weights 


None, SQRT(N), N 


ITT\nOY*GP l/firi — 
lllVCloL Vail 


111VCL5C V til 1 










Prior assumptions on the effect 


No 


Tn Ts E»"vroci ct n 
111 Dd V t_-t.ld.ll 


Tn T^ c*"\roci a n 
111 LJ d V C o 1 d.l 1 


size can be used 




m6t £L- ctiicily sis 


met 3,- ciUctly sis 


Prior uncertainty on 


No 


No 


j. ii A—} cx y Colo, ii 


heterogeneity can be 






111c L d dlldl y t>lo 


accommodated 








Prior uncertainty on the ge- 


No 




Tn Rft vpqi a n 

111 JUaiV CBlOrll 


netic model can be accommo- 




M A 

1V1- JrV 


metci-cLUojlysis 


dated 








Normality assumptions typically 


Yes 


Yes 


Yes 


made within each study 








Normality assumptions within 


Yes, rarely done 


VpC T'A Vpl ~\T 

i CO] i eii civ 




each study easily testable 




done 




Normality assumptions for 


No effects assumed 


Single common 


Not easily 


distribution of effects across 




elicit L aba LllllcLl 


ICO LdUlC 


studies easily testable 




1 OCCHTTlT J \'i"l ATI 

^dbb UllipLlUll 








m Ct^r no in ci 
llldV Lfc V lol IJ1 y 








wrong) 




Heavy-tail alternative methods 


No 


i co , i di ciy 


ICS, idiciy 


exist 




USCd 




Use with uncommon alleles 


Need to use exact methods 


Quite robust 


J_>etween-stuciy 


(small genotype groups, or 






Vdl Idllcc 


even zero allele counts in 






US L UlldLlLHl 


2x2 tables) 






unstable 


Power for discovery 


Good 


Good 


T,pcc Tnan nTn- 

JJCkJD Lj llcLll KJ Lj 11 


False-positives from single 


Susceptible 


O U.!_iCC|J LlLflc 


T ,OCO Gil Cr*_OT~\^~1 

JUcoa a UqccJJ LI 


biased dataset 






ble 


False-positives when evidence 


Susceptible 


Susceptible 


More suscepti- 


from small studies is most 






ble 


biased 








False-positives when evidence 


Susceptible 


Susceptible 


Less suscepti- 


from large studies is most biased 






ble 


Can predict range of effect sizes 


No 


Too narrow 


Appropriate 


in future similar populations 




confidence 


with predictive 






intervals 


intervals 


Can convey uncertainty for 


Useless 


Inappropriate 


Most ap- 


practical applications (e.g., to 






propriate 


be used in clinical prediction 






with prediction 


test) 






intervals 
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consider the cumulative evidence (both the original 
data plus the follow-up data) as an updated meta- 
analysis, the cumulative evidence may still pass 
genome-wide significance or a sufficient Bayes Fac- 
tor threshold, even though the follow-up data are 
not formally (highly) significant, when seen in iso- 
lation. 

(c) The genotypic coding used in the initial study 
may not accurately reflect the true underlying asso- 
ciation, leading to a loss of power. Ideally the follow- 
up study should be well powered to detect associa- 
tions based on different genetic models (e.g., reces- 
sive, dominant) that are consistent with the results 
observed in the first study. 

(d) The variant may be a poor marker for the 
trait due to differences in linkage-disequilibrium 
structure between the studies. This is more likely 
if the study populations have different ethnic back- 
grounds. When discussing this as a possible rea- 
son for nonreplication, investigators should make a 
good-faith effort to provide empirical data on how 
linkage-disequilibrium patterns differ (e.g., using 
HapMap data) and how these differences would lead 
to inconsistencies across studies. 

(e) Differences in design or trait definition may 
lead to inconsistencies. See Sections 6.1 and 6.2 for 
examples of how different matching or ascertain- 
ment schemes can affect estimates of marker-trait 
association. Again, when citing this as a reason for 
nonreplication, investigators should as far as possi- 
ble present arguments for the likelihood and mag- 
nitude of differences due to design or measurement 
differences. 

(f) The absence of an association in the subse- 
quent studies may be due to true etiologic hetero- 
geneity. Sometimes, this may be driven by gene-gene 
or gene-environment interaction. If cases in the orig- 
inal study were required to have a family history of 
disease, for example, or required to have a relatively 
rare exposure profile (e.g., male lifetime never smok- 
ers), then subsequent studies that do not impose 
these restrictions may not see the association, if the 
association is restricted to subgroups with a par- 
ticular genetic or exposure background. However, 
to date, gene-gene and gene-environment interac- 
tions have been notoriously difficult to document 
robustly. 

For the last three explanations, it is useful to clar- 
ify if the explanation was offered a posteriori af- 
ter observing the inconsistent results in different 



studies. Post hoc explanations for subgroup differ- 
ences, interactions and effect modification may be 
overfit to the observed data and may require fur- 
ther prospective replication in further datasets be- 
fore they can be relied upon. 

6. THE WIDER PICTURE OF REPLICATION 
EFFORTS: CONSORTIA, DATA AVAILABILITY 
AND FIELD SYNOPSES 

With the recent successes of GWA studies, the 
field has realized that increasingly large sample sizes 
are required to identify and replicate the increas- 
ingly small effect sizes at common variants that re- 
main undetected. Even wider networks will be re- 
quired to facilitate the study of variation at the 
lower end of the frequency spectrum (be it single 
base changes, copy number variants or otherwise). 
Collaboration and data sharing are invaluable tools 
in achieving the necessary sample sizes for 
well-powered replication studies. The past few years 
have witnessed a rapid rise in international con- 
sortium formation and collaboration has taken a 
most prominent role in conducting research. Consor- 
tia allow investigators to make some design choices 
up front (if only deciding which SNPs to attempt 
to replicate), and to work together to harmonize 
phenotypes and analyses [71]. Several examples of 
notable successes of consortium-coordinated efforts 
have started to emerge in the literature [2, 10, 66, 
67, 74]. 

In silico replication of association signals has been 
further facilitated by initiatives making genetic asso- 
ciation study results and/or raw data publicly avail- 
able (or available through application to an access 
committee), for example, the Wellcome Trust Case 
Control Consortium (www.wtccc.org.uk), dbGAP 
(http : / / www . ncbi . nlm . nih . gov /sites / entrez ?db=gap ) 
and the European Genotype Archive (EGA, 
http : // www . ebi . ac . uk/ ega) . Several emerging con- 
siderations, for example, with respect to the 
anonymity of data [25], avenues for communication 
between primary investigators and secondary users 
to facilitate a better understanding of the datasets 
and their appropriate uses, and suitable accredita- 
tion of involved parties, require resolution in order 
to optimize the use of publicly available raw data. 

Replication undoubtedly constitutes an evolving 
practice. The need to incorporate new data aris- 
ing from further GWA scans, other replication stud- 
ies, meta-analyses or all of the above leads to the 
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Table 2 

Cumulative power to detect association (ot = 5 X 10 _8 j at a risk allele with frequency 0.20 and 0.40, and allelic odds ratios of 
1.1 and 1.2, given sample sizes for the WTCCC, DGI and FUSION studies 



Studies 


Risk allele 


Allelic 


Cumulative n 


Cumulative n 


Power 




frequency 


odds ratio 


cases 


controls 




WTCCC 


0.20 


1.10 


1924 


2938 


0.0002 


WTCCC + DGI 


0.20 


1.10 


3388 


4405 


0.0011 


WTCCC + DGI + FUSION 


0.20 


1.10 


4549 


5579 


0.0033 


WTCCC 


0.40 


1.10 


1924 


2938 


0.0007 


WTCCC + DGI 


0.40 


1.10 


3388 


4405 


0.0054 


WTCCC + DGI + FUSION 


0.40 


1.10 


4549 


5579 


0.0166 


WTCCC 


0.20 


1.20 


1924 


2938 


0.0333 


WTCCC + DGI 


0.20 


1.20 


3388 


4405 


0.2078 


WTCCC + DGI + FUSION 


0.20 


1.20 


4549 


5579 


0.4426 


WTCCC 


0.40 


1.20 


1924 


2938 


0.1336 


WTCCC + DGI 


0.40 


1.20 


3388 


4405 


0.5468 


WTCCC + DGI + FUSION 


0.40 


1.20 


4549 


5579 


0.8219 



emerging paradigm of conglomerate analyses. Field 
synopses, for example, are efforts to integrate data 
from diverse sources (GWA studies, consortia, sin- 
gle published studies) in the published literature 
and to make them publicly available in electronic 
databases that can be updatable. Examples include 
the field synopses on Alzheimer's disease (AlzGene 
database), schizophrenia (SzGene database) and 
DNA repair genes [3, 63]. The results of the meta- 
analyses on the accumulated data can then also be 
graded for their epidemiological credibility, for ex- 
ample, as proposed by the Venice criteria [31]. 

6.1 Example from the Field of Type 2 Diabetes 

Researchers in the field of Type 2 diabetes (T2D) 
genetics were among the first to lead the way in 
distributed collaborative networks, exemplified by 
early efforts such as the International Type 2 Dia- 
betes Linkage Analysis Consortium and the Inter- 
national Type 2 Diabetes lq Consortium [11, 16, 
72]. The advent of GWA scans was met by pre- 
publication data sharing between three large-scale 
studies, the WTCCC, DGI and FUSION scans [56, 
57, 59, 73], leading to the formation of the DIA- 
GRAM Consortium (Diabetes Genetics Replication 
and Meta-analysis). By exchanging information on 
top signals, the three studies obtained in silico repli- 
cation of individual scan findings and then further 
pursued de novo replication in additional sets of in- 
dependent samples. This endeavor additionally high- 
lighted examples of statistical heterogeneity across 
the studies, notably with respect to one of the 
WTCCC study's strongest signals, residing within 



the FTO gene [13]. This inconsistency in observed 
associations could be ascribed to study design and, 
specifically, to matching cases and controls for BMI 
(DGI study). The FTO signal was quickly identi- 
fied as the first robustly replicating association with 
obesity, mediating its effect on T2D through BMI. A 
truly genome-wide meta-analysis of the three scans 
ensued, with large-scale replication efforts in inde- 
pendent datasets of T2D cases and controls, all of 
European origin. This effort led to the identification 
of further novel T2D susceptibility loci [74]. Table 2 
demonstrates the gains in power afforded by increas- 
ing sample size from a single scan to the synthesis 
of all three studies for a realistic common complex 
disease susceptibility locus. 

6.2 Anthropometrics and the Analysis of 
"Secondary Traits" 

The meta-analyses of body mass index and height 
conducted by the Genetic Investigation of ANthro- 
pometric Traits (GIANT) consortium raised addi- 
tional issues [36, 43, 67]. Specifically, unlike the dia- 
betes consortia, where each participating study was 
designed with diabetes as its primary outcome, the 
studies involved in GIANT were not originally de- 
signed to study determinants of BMI and height, 
rather they were originally case-control studies of 
diabetes, prostate and breast cancers, and other dis- 
eases [15, 40]. In principle, if the studied trait is 
associated with disease risk, then conditioning on 
case-control status can create a spurious association 
between a marker and the trait. In practice, only a 
small number of markers will have an inflated Type 
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I error rate — namely, those markers that are asso- 
ciated with disease risk but not directly with the 
secondary trait — and the magnitude of the inflation 
depends on both the strength of the association be- 
tween the secondary trait and disease (which could 
be modest or controversial, as in the case of smok- 
ing and breast or prostate cancer, or quite strong, 
as in the case of BMI and T2D or smoking and lung 
cancer) and the strength of the association between 
the marker and disease (typically relatively weak) 
[15, 48]. Moreover, the risk of false positives may be 
further ameliorated by diversity of designs among 
the participating studies — some may have originally 
been case-control studies of different diseases, oth- 
ers may have been cohort or cross-sectional stud- 
ies. Although there are analytic methods that can 
eliminate spurious association or bias due to case- 
control ascertainment in particular situations and 
under particular assumptions [40, 48], these should 
not replace careful consideration of potential biases 
and evaluation of heterogeneity in effect measures 
across studies with different designs. 
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